Statistik fuer Biologie FS 2024

Week 1

Exercise 1

  1. No. Collectors did not choose randomly, but preferred rarer types over more common ones.

  2. It is a sample of convenience.

  3. There is bias in every year’s sample because rare types are over-represerented. Further this bias might change across years as the frequencies of the two morphs have changed over time.

Exercise 2

  1. Discrete.

  2. Technically it is a discrete variable (because fractions can be enumerated / are restricted to finitely many values in a finite sample) but it makes sense to treat it as a continuous variable if the sample is very large.

  3. Discrete. There are no half crimes.

  4. Continuous. Body mass is continuous and hence the log as well.

Exercise 3

  1. E(xplanatory): Altitude (categorical: high vs low) R(esponse): Growth rate S(tudy type): Observational

  2. E: Treatment (standard vs. tasploglutide) R: Rate of insulin release S: Experimental

  3. E: Health status (schizophrenia vs. healthy) R: Frequency of illegal drug use S: Observational

  4. E:Number of legs R: Survival propability S: Experimental

  5. E: Treatment (advanced communication therapy vs. social visits without formal therapy) R: Communication ability S: Experimental

Exercise 4

The main problem is a strong bias in the sample. To see this, consider the sample of planes that were used in this study. Only planes that were not hit in critical areas were available to estimate the distribution of bullet holes. Planes that were hit in a critical area, i.e., one that leads to a crash, were not available because they did not return to base. With this knowledge, it becomes clear that it would have been better to reinforce the areas were no, or very little, bullet holes were found, namely the cockpit and engine.

Exercise 5

  1. The population parameter being estimated is all the small mammals of Kruger National Park.

  2. No, the sample is not likely to be random. In a random sample, every individual has the same chance of being selected. But some small mammals might be easier to trap than others (for example, trapping only at night might miss all the mammals active only in the day time). In a random sample individuals are selected independently. Multiple animals caught in the same trap might not be independent if they are related or live near one another (this is harder to judge).

  3. The number of species in the sample might underestimate the number in the Park if sampling was not random (e.g., if daytime mammals were missed), or if rare species happened to avoid capture. Even if the sample is perfectly random, the estimator will underestimated the true number in most cases. You can sample fewer species just by chance, but not more species as there actually are. Thus - on average - you will underestimate the true number.

Week 2

Exercise 6

  1. You should see a blank plot. This is what ggplot does, it simply creates a “canvas” to which you can add “layers”.

  2. How many rows are in mpg? How many columns?

str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

There are 234 observations of 11 variables. thus, there are 11 columns and 234 rows. we can confirm this by looking at the dimension of the underlying data matrix:

dim(mpg)
## [1] 234  11
?mpg

drv is a categorical vaiable indicating the drive type: f = front-wheel drive, r = rear wheel drive, 4 = 4wd

ggplot(data  = mpg) + 
  geom_point(mapping = aes(y = hwy,x = cyl)) + 
  theme_classic() + 
  labs(title = "A scatterplot", y = "miles per gallon on highways",x="cylinders")

ggplot(data  = mpg) + 
  geom_point(mapping = aes(x = class,y = drv)) + 
  theme_classic() + 
  labs(title = "A useless plot", x = "class",y="drive type")

both drv and class are categorical variable and it does not make sense to visualize them this way. What would be a better way to show these data?

  1. Color is used wihtin the aesthetics function, where we specifiy which varibales should be used. “blue” is however not a variable in our data frame. Thus it is ignored. The follwoing code creates the correct plot:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

  1. A continuous variable will lead ot a continuous color gradient rather than a discrete set of colors.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = cty, y = hwy, color = displ))

  1. The “+” symbol always has to be in the top row:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

Week 3

Exercise 7

genes <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter04/chap04e1HumanGeneLengths.csv"))

# a) 

n = dim(genes)[1]
print(n)
## [1] 20290
# b)    
sum.gene.lengths = sum(genes$geneLength)

# c)      

mean.gene.length1 = sum.gene.lengths/n
mean.gene.length2 = mean(genes$geneLength)

print(mean.gene.length1)
## [1] 2622.027
print(mean.gene.length2)
## [1] 2622.027
# d)
square.sum.gene.lengths = sum(genes$geneLength^2)
print(square.sum.gene.lengths)
## [1] 223678143206
# e)    

# by hand:
variance.gene.length = sum((genes$geneLength - mean.gene.length1)^2)/(n-1)

print(variance.gene.length)
## [1] 4149235
# for comparison:
var(genes$geneLength)
## [1] 4149235
# f)    
sd(genes$geneLength)
## [1] 2036.967
# or 

sd.gene.length = sqrt(variance.gene.length)

# g)     

cv.gene.length = sd.gene.length/mean.gene.length1
print(cv.gene.length)
## [1] 0.7768672
# h)    
# Which one do you like best?

ggplot(data = genes) + 
  geom_histogram(mapping = aes(x = geneLength),color="black",fill="darkred") + 
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = genes) + 
  geom_density(mapping = aes(x = geneLength),color="black",fill="darkred") + 
  xlab("Länge der Gene") +
  ylab("Häufigkeit") +
  scale_x_log10() +
  theme_classic()

ggplot(data = genes) + 
  geom_boxplot(mapping = aes(y = geneLength),color="black",fill="darkred") + 
  theme_classic() + 
  coord_flip()

Week 4

Exercise 8

# load the data file:
stickleback <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter03/chap03e3SticklebackPlates.csv"))

# one possible solution:

mean.mm = mean(stickleback$plates[stickleback$genotype=="mm"])
mean.mM = mean(stickleback$plates[stickleback$genotype=="Mm"])
mean.MM = mean(stickleback$plates[stickleback$genotype=="MM"])

median.mm = median(stickleback$plates[stickleback$genotype=="mm"])
median.mM = median(stickleback$plates[stickleback$genotype=="Mm"])
median.MM = median(stickleback$plates[stickleback$genotype=="MM"])

par(mfrow = c(1,3))

hist(stickleback$plates[stickleback$genotype=="MM"],
     main = "MM",xlab = "number of plates",
     col="darkred")

abline(v = mean.MM,col="black")
abline(v = median.MM,col="blue")

hist(stickleback$plates[stickleback$genotype=="Mm"],
     main = "Mm",xlab = "number of plates",
     col="darkred")
abline(v = mean.mM,col="black")
abline(v = median.mM,col="blue")

hist(stickleback$plates[stickleback$genotype=="mm"],
     main = "mm",xlab = "number of plates",
     col="darkred")
abline(v = mean.mm,col="black")
abline(v = median.mm,col="blue")

# alternatively, we can first create a data frame 
# with the means per genotype
# check out the help function of ddply
# (this is only one way to do this, you can also do it by 
# "hand" or using other functions such as tapply) :

library(dplyr)
df.mean = ddply(stickleback, "genotype", summarize, m.number = mean(plates))
df.median = ddply(stickleback, "genotype", summarize, m.number = median(plates))

# then we plot it with a single ggplot
ggplot() +
  geom_bar(data = stickleback, mapping = aes(x=plates),binwidth=5,color="black",fill="darkred") +
  geom_vline(data = df.mean, aes(xintercept=m.number),col="black",size=1) +
  geom_vline(data = df.median, aes(xintercept=m.number),col="blue",size = 1) +
  facet_wrap(~ genotype) +
  ylab("frequency") +
  theme_classic()
## Warning in geom_bar(data = stickleback, mapping = aes(x = plates), binwidth =
## 5, : Ignoring unknown parameters: `binwidth`
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Exercise 9

The code simualtes the samplind distirbution of the mean. Here is an alternative version for the Poisson distribution:

# We choose the Poisson distribution with mean lambda = 50
lambda = 50

# We do 1000 replicates
reps = 1000

# We choose the sample size n to be 10
sample_size = 10

# the means of each sample will be stroed in this vector
means = vector("numeric",reps)

# in this loop, we will calcualte the means of the sample
for (i in 1:reps)
{
  # take a sample of the poisson distribution
  sample = rpois(sample_size,lambda)
  
  # store the mean in our vector
  means[i] = mean(sample)
}

# a histogram - note that we set freq = FALSE to 
# show realtive frequencies 
hist(means,freq=F,main="The sampling distribution of a Poisson distribution")

# we specify the x values for the theoretical poisson
x = seq(0,3*lambda,by=0.1)

#we calcualte the standard error:
SE = sqrt(lambda)/sqrt(sample_size)

# caclualte the corresponding PDF of the poisson
y = dnorm(x,lambda,SE)

lines(x,y)

Week 5

Exercise 10

  1. What is the probability that a randomly drawn slice has pepperoni on it?

\[P(\text{pepperoni}) = 5/8 = 0.625 = 62.5 \%\]

  1. What is the probability that a randomly drawn slice has both pepperoni and anchovies on it?

\[P(\text{pepperoni} \quad \text{and} \quad \text{anchovies}) = 2/8 = 25 \%\]

  1. What is the probability that a randomly drawn slice has either pepperoni or anchovies on it?

\[P(\text{pepperoni} \quad \text{or} \quad \text{anchovies}) = 7/8 = 0.875 = 87.5\%\]

  1. Are pepperoni and anchovies mutually exclusive on this pizza?

No. There are two slices with both pepperoni and anchovies.

  1. Are olives and mushrooms mutually exclusive on this pizza?

Yes. There is no slice that has both olives and mushrooms on it.

  1. Are getting mushrooms and getting anchovies independent when randomly picking a slice of pizza?

No, because

\[P(\text{anchovy}) = 4/8 = 0.5\]

\[P(\text{mushroom}) = 3/8 = 0.375\]

and

\[P(\text{mushroom} \quad \text{and} \quad \text{anchovy}) = 1/8\]

but

\[P(\text{mushroom})*P(\text{anchovy}) = 3/8 * 4/8 = 3/16 = 0.1875 = 18.75 \% .\]

  1. If I pick a slice from this pizza and tell you that it has olives on it, what is the chance that it also has anchovies?

\[P(\text{anchovy} \mid \text{olive}) =\] \[P(\text{anchovy} \quad \text{and} \quad \text{olive})/P(\text{olive}) = (1/8) / (2/8) = 1/2 = 0.5 = 50\%\]

  1. If I pick a slice from this pizza and tell you that it has anchovies on it, what is the chance that it also has olives?

\[P(\text{olive} \mid \text{anchovy}) =\] \[P(\text{olive} \quad and \quad \text{anchovy})/P(\text{anchovy}) = (1/8) / (4/8) = 1/4 = 0.25 = 25\%\]

  1. Seven of your friends each choose a slice at random and eat them without telling you what toppings they had. What is the chance that the last slice has olives on it?

\[P(last \quad slice \quad has \quad olives) = 2/8 = 0.25 = 25\%\].

This can be seen directly because each slice has the same probability to be the last slice.

  1. You choose two slices at random. What is the chance that they have both olives on them? (Be careful: after removing the first slice, the probability of choosing one of the remaining slices changes.)

\[P(first \, slice \,has \, olives) = 2/8\]

\[P(second \, slice \, also \, has \, olives) = 1/7\]

\[P(both \, slices \, have \, olives) = 2/8 * 1/7 = 1/28 \approx 3.6\% \]

  1. What is the probability that a randomly chosen slice does not have pepperoni on it?

\[P(no \, pepperoni) = 1 – P(pepperoni) = 3/8 = 37.5\%\]

  1. Draw a pizza for which mushrooms, olives, anchovies and pepperoni are all mutually exclusive
A pizza for which mushrooms, olives, anchovies and pepperoni are all mutually exclusive
A pizza for which mushrooms, olives, anchovies and pepperoni are all mutually exclusive

Exercise 11

  1. What is the probability that you picked up no dangerous snakes?

\[P(no \, dangerous \, snakes) = \] \[ P(no \, dangerous \, snake \, in \, left \, hand) P(no \, dangerous \, snake \, in \, right \, hand)\] \[= 3/8 * 2/7 = 6/56 = 0.107 = 10.7 \%\]

  1. Assume that any dangerous snake that you pick up has a probability of 0.8 of biting you. The defanged snakes do not bite. What is the chance that, in picking up your two snakes, you are bitten at least once?

\[P(bite) = P(bite \mid 0 \, dangerous \, snakes) P(0 \, dangerous \, snakes)\] \[+ P(bite \mid 1 \, dangerous snakes) P(1 dangerous \, snakes) \] \[+P(bite \mid 2 \, dangerous snakes) P(2 \, dangerous \, snakes).\]

From (a) we know that: \[P(0 \, dangerous \, snakes) = 0.107.\]

Next we calcualte: \[P(1 \, dangerous \, snake) = (5/8 * 3/7) + (3/8 * 5/7) = 0.536\] \[P(2 \, dangerous \, snakes) = 5/8 * 4/7 = 0.357\] \[P(bite \mid 0 \, dangerous \, snakes) = 0\] \[P(bite \mid 1 \, dangerous \, snake) = 0.8\] \[P(bite \mid 2 \, dangerous \, snakes) = 1- P(no \, bite \mid 2 \, dangerous \, snakes) = 1-(1- 0.8)^2 = 0.96\]

Putting al this together: \[P(bite) = (0*0.107)+(0.8*0.536)+(0.96*0.357) = 0.772\]

  1. If you picked up one snake and it didn’t bite you, what is the probability that it is defanged?

\[P(defanged \mid no \, bite)= \left[P(no \, bite \mid defanged)P(defanged)\right]/P(no \, bite)\] \[P(no \, bite \mid defanged)=1\] \[P(defanged) = 3/8\]

    $$P(no  \,  bite) = P(defanged)P(no bite  \mid defanged) $$
      $$+ P(dangerous)P(no  \,  bite  \mid dangerous) = (3/8 *1) + (5/8  (1-0.8)) = 0.5$$
        
        Therefore 
      $$P(defanged  \mid no  \,  bite) = (1*3/8 )/ (0.5) = 0.75 = 75\%.$$
        
        

Exercise 12

  1. What is the probability that all five researchers have calculated an interval that includes the true value of the parameter?

\(0.95^5 = 0.774\)

  1. What is the probability that at least one does not include the true parameter value.

\(1- 0.95^5 = 0.226\)

Week 6

Exercise 13

We first calculate the mean for each genotype:

stickleback = read.csv("~/Dropbox/Teaching/StatisticsForBiology/chap03e3SticklebackPlates.csv")

mean.mm = mean(stickleback$plates[stickleback$genotype=="mm"])
mean.Mm = mean(stickleback$plates[stickleback$genotype=="Mm"])
mean.MM = mean(stickleback$plates[stickleback$genotype=="MM"])

Next the upper and lower limit of the 95% CI using a t-test:

t.t.mm = t.test(stickleback$plates[stickleback$genotype=="mm"])
t.t.Mm = t.test(stickleback$plates[stickleback$genotype=="Mm"])
t.t.MM = t.test(stickleback$plates[stickleback$genotype=="MM"])

CI.mm = t.t.mm$conf.int
CI.Mm = t.t.Mm$conf.int
CI.MM = t.t.MM$conf.int

We put it all in a dataframe:

df = data.frame(means = c(mean.mm,mean.Mm,mean.MM),
                lower = c(CI.mm[1],CI.Mm[1],CI.MM[1]),
                upper = c(CI.mm[2],CI.Mm[2],CI.MM[2]),
                genotypes = c("mm","Mm","MM"))

Now we can plot everything:

ggplot(data = df) +
  geom_errorbar(mapping = aes(x = genotypes,ymin=lower, ymax=upper), width=.1) +
  geom_point(mapping = aes(x = genotypes,y=means)) 

If you prefer (I don’t) ou can also do a bar plot

ggplot(data = df) +
  geom_col(mapping = aes(x = genotypes,y=means,fill=genotypes)) +
  geom_errorbar(mapping = aes(x = genotypes,ymin=lower, ymax=upper), width=.1) +
  geom_point(mapping = aes(x = genotypes,y=means))

Either way, it is advised to also add the raw data:

ggplot(data = df)  + 
  geom_jitter(data = stickleback,mapping = aes(x = genotype,y = plates,color=genotype),size=0.8,width=0.05)+
  geom_errorbar(mapping = aes(x = genotypes,ymin=lower, ymax=upper), width=.2) +
  geom_point(mapping = aes(x = genotypes,y=means))

Exercise 14

We first choose values for \(\mu\) and \(\sigma\) (you can chose any values you want)

mu = 10
sigma = 2

Next we choose the values for the x axis. A good choice is to cover the range from \(\mu - 4 \sigma\) to \(\mu + \sigma\). We chose a stepsize of 0.1:

x = seq(mu-4*sigma,mu+4*sigma,by=0.1)

Now we compute the corresponding \(y\) values for our plot using the function dnorm:

y = dnorm(x, mean = mu, sd = sigma)

I now do the same thing for the standard normal distribution:

mu.std = 0
sigma.std = 1
x.std = seq(mu.std-4*sigma.std,mu.std+4*sigma.std,by=0.1)
y.std = dnorm(x.std, mean = 0, sd = 1)

And finally we plot both density functions next to each other:

par(mfrow = c(1,2))
plot(x,y,type="l",xlab="x",ylab="density")
plot(x.std,y.std,type="l",xlab="x",ylab="density")

We see that both curves look exactly the same. The only difference is the scale of the x and y axes.

random.sample = rnorm(1000,mu,sigma)
hist(random.sample,freq=F,ylim=c(0,0.2),breaks=30)
lines(x,y)

  1. Calculate the 10 percent quantile of the distribution. Store the value in a variable \(Q10\). Then calcualte the proportion of numbers in your random sample from (b) that are smaller or equal to \(Q10\). What do you find?
Q10 = qnorm(0.1,mean=mu,sd=sigma)
sum(random.sample<=Q10)/1000
## [1] 0.103

We find that approximately 10 percent of our random sample is smaller than \(Q10\). This is exactly the definition of a quantile!

Exercise 15

  p = seq(0.05,0.95,by=0.05)
  Q = qnorm(p,mean=0,sd=1)
  C = pnorm(Q,mean=0,sd=1)
  plot(p,C)

We find that \(p = C\). This means that the cumulative distribution function is the inverse of the quantile function.

The p-quantile \(q_p\) is defined as the number such that
\[P(X < q_p) = p.\] We define a quantile function \(Q(p) = q_p\). The cumulative distribution function \(F(x)\) gives us the value \[F(x) = P(X<x)\] Then \(F(q_p) = P(X < q_p) = p\)

and because

\[Q(p) = q_p\] if follows

that

\[F(Q(p)) = F(q_p) = p\] Here is a graphical representation of this relationship

cdf.std = pnorm(x.std,mean=0,sd=1)
plot(x.std,cdf.std,type="l",xlab="x",ylab = "P(X<x) ")
abline(v=qnorm(0.1,mean=0,sd=1),lwd=2,col="dodgerblue")
abline(h = 0.1,col="slategray",lwd=2)
text(-1.8,0.3,expression(Q[10]),col="dodgerblue")
text(-3.3,0.2,expression(P(X < Q[10])==0.1),col="slategray")

  1. Indicate the smallest 5 percent of the distribution. In other words: Let \(X\) be a normally distributed random variable. For which range \((-\infty,c_1]\) to we get \(P(X \leq c_1) = 0.05\)?
mu.std = 0
sigma.std = 1
x.std = seq(mu.std-4*sigma.std,mu.std+4*sigma.std,by=0.1)
y.std = dnorm(x.std, mean = 0, sd = 1)
plot(x.std,y.std,type="l")
c1 = qnorm(0.05,0,1)
abline(v = c1)

  1. Indicate the most extreme 5 percent of the distribution. In other words: For which range \((-\infty,c_2] \cup [c_2,\infty]\) to we get \(P(X \leq c_2 \text{ or } X \ > c_2 ) = 0.05\)? [Hint: you can use the quantile function of \(R\) and the symmetry of the Normal distribution to simplify things]
mu.std = 0
sigma.std = 1
x.std = seq(mu.std-4*sigma.std,mu.std+4*sigma.std,by=0.1)
y.std = dnorm(x.std, mean = 0, sd = 1)
plot(x.std,y.std,type="l")
c2 = qnorm(c(0.025,0.975),0,1)
abline(v=c2)

We can add some simualted data if we want:

set.seed(1984)
n = 1000
random.sample = rnorm(n,0,1)
sum(random.sample < c2[1] | random.sample > c2[2])/n
## [1] 0.052
plot(x.std,y.std,type="l")
rug(random.sample)
rug(random.sample[random.sample < c2[1] | random.sample > c2[2]],col="red")

We see that roughly 5 percent fall into the extreme tails of the distribution. this is sensible, because we have defined the extreme part here as the range \((-\infty,c_2] \cup [c_2,\infty]\), where \(-c_2\) (\(c_2\)) is the 2.5 (97.5) percentile of the distribution.

Week 7

Exercise 16

Remember that pnorm(x) = P(Z < x) and that 1 - pnorm(x) = P(Z > x).

  1. P(Z > 1.34)
x = seq(-4,4,by=0.01)
y = dnorm(x,0,1)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 1.34,col = "red")
text(2.3,0.05,"P(Z > 1.34)")

prob = 1 - pnorm(1.34,0,1)
prob
## [1] 0.09012267
  1. P(Z < 1.34)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 1.34,col = "red")
text(0,0.05,"P(Z < 1.34)")

prob = pnorm(1.34,0,1)
prob
## [1] 0.9098773
  1. P(Z < 2.15)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 2.15,col = "red")
text(0,0.05,"P(Z < 2.15)")

prob = pnorm(2.15,0,1)
prob
## [1] 0.9842224
  1. P(Z < 1.2)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 1.2,col = "red")
text(0,0.05,"P(Z < 1.2)")

prob = pnorm(1.2,0,1)
prob
## [1] 0.8849303
  1. P(0.52 < Z < 2.34)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 2.34,col = "red")
abline(v = 0.52,col = "red")

text(1.5,0.3,"P(0.52 < 
Z < 
2.34)")

prob = pnorm(2.34,0,1) - pnorm(0.52,0,1)
prob
## [1] 0.2918899
  1. P(-2.34 < Z < -0.52)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = -2.34,col = "red")
abline(v = -0.52,col = "red")

text(-1.5,0.3,"P(0.52 < 
Z < 
2.34)")

prob = pnorm(-0.52,0,1) - pnorm(-2.34,0,1) 
prob
## [1] 0.2918899
  1. P(Z < -0.93)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = -0.93,col = "red")

text(-2,0.3,"P(Z < -0.93)")

prob = pnorm(-0.93,0,1) 
prob
## [1] 0.1761855
  1. P(-1.57 < Z < - 0.32)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = -1.57,col = "red")
abline(v = - 0.32,col = "red")

text(-1.5,0.3,"P(-1.57 < 
Z < 
- 0.32)")

prob = pnorm(-0.32,0,1) - pnorm(-1.57,0,1) 
prob
## [1] 0.3162766

Exercise 17

We first calculate the mean for each genotype:

stickleback = read.csv("~/Dropbox/Teaching/StatisticsForBiology/chap03e3SticklebackPlates.csv")

mean.mm = mean(stickleback$plates[stickleback$genotype=="mm"])
mean.Mm = mean(stickleback$plates[stickleback$genotype=="Mm"])
mean.MM = mean(stickleback$plates[stickleback$genotype=="MM"])

Next we do teh same thing for the standard deviation:

sd.mm = sd(stickleback$plates[stickleback$genotype=="mm"])
sd.Mm = sd(stickleback$plates[stickleback$genotype=="Mm"])
sd.MM = sd(stickleback$plates[stickleback$genotype=="MM"])

And now we can make 3 plots, one fore each case:

x = seq(0,100,by = 0.01)

hist(stickleback$plates[stickleback$genotype=="mm"],freq=F,
     main =  "Genotype: mm",xlab = "Plates",ylab  = "Frequency ")
y = dnorm(x,mean.mm,sd.mm)
lines(x,y,lwd=2,col="red")

hist(stickleback$plates[stickleback$genotype=="Mm"],freq=F,
     main =  "Genotype: Mm",xlab = "Plates",ylab  = "Frequency ",
     xlim=c(0,100))
y = dnorm(x,mean.Mm,sd.Mm)
lines(x,y,lwd=2,col="red")

hist(stickleback$plates[stickleback$genotype=="MM"],freq=F,
     main =  "Genotype: MM",xlab = "Plates",ylab  = "Frequency ",
     xlim=c(0,100))
y = dnorm(x,mean.MM,sd.MM)
lines(x,y,lwd=2,col="red")

It can be seen clearly from the images that the normal distribution is not a very good approximation to reality in this case. In particular for heterozygotes.

  1. We use the function t.test to calculate the upper and lower limit of the 95% CI:
t.t.mm = t.test(stickleback$plates[stickleback$genotype=="mm"])
t.t.Mm = t.test(stickleback$plates[stickleback$genotype=="Mm"])
t.t.MM = t.test(stickleback$plates[stickleback$genotype=="MM"])

CI.mm = t.t.mm$conf.int
CI.Mm = t.t.Mm$conf.int
CI.MM = t.t.MM$conf.int

We put it all in a dataframe to make comparison easier:

df = data.frame(means = c(mean.mm,mean.Mm,mean.MM),
                lower = c(CI.mm[1],CI.Mm[1],CI.MM[1]),
                upper = c(CI.mm[2],CI.Mm[2],CI.MM[2]),
                genotypes = c("mm","Mm","MM"))

Now we calculate the standard error of the mean for each genotype. We can use the standard deviation we have calculated in (a) for this. First we need the sample size for each group:

n.mm = length(stickleback$genotype=="mm")
n.Mm = length(stickleback$genotype=="Mm")
n.MM = length(stickleback$genotype=="MM")

Then we can directly calculate the standard error:

SE.mm = sd.mm/(sqrt(n.mm))
SE.Mm = sd.mm/(sqrt(n.Mm))
SE.MM = sd.mm/(sqrt(n.MM))

Adding the results to the data frame:

lower.approx = c(mean.mm-2*SE.mm,mean.Mm-2*SE.Mm,mean.MM-2*SE.MM)
upper.approx = c(mean.mm+2*SE.mm,mean.Mm+2*SE.Mm,mean.MM+2*SE.MM)

df.2 = mutate(df, lower.2SE = lower.approx,
              upper.2SE = upper.approx)

print(df.2)
##      means    lower    upper genotypes lower.2SE upper.2SE
## 1 11.67045 10.91451 12.42640        mm  11.28573  12.05518
## 2 50.37931 48.11287 52.64575        Mm  49.99458  50.76404
## 3 62.78049 62.03116 63.52982        MM  62.39576  63.16521

We can see that the values are similar but that the 2 SE method gives us narrower confidence intervals. This is due to the small sample size so that the sampling distribution looks more like t distribution than a normal distribution.

Week 8

Exercise 18

geneContent <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08e4XGeneContent.csv"))
head(geneContent)
##   chromosome
## 1          X
## 2          X
## 3          X
## 4          X
## 5          X
## 6          X
geneContent$chromosome <- factor(geneContent$chromosome, levels = c("X","NotX"))


geneContentTable <- table(geneContent$chromosome)
data.frame(Frequency = (geneContentTable))
##   Frequency.Var1 Frequency.Freq
## 1              X            781
## 2           NotX          19509
chisq.test( geneContentTable, p = c(1055, 19235)/20290 )
## 
##  Chi-squared test for given probabilities
## 
## data:  geneContentTable
## X-squared = 75.065, df = 1, p-value < 2.2e-16
library(binom)
binom.confint(781, n = 20290, method = "ac")
##          method   x     n       mean      lower      upper
## 1 agresti-coull 781 20290 0.03849187 0.03592951 0.04122894
1055/20290
## [1] 0.05199606

Intepretation:

Our tests shows that there is a very small \(p-value\) indicating that the null hypothesis

\(H_0:\) The proportion of genes on the X-chromomose is the same as the proportion of base pairs on the X-Chromosome.

can be rejected in favor of the alternative hypothesis

\(H_A:\) The proportion of genes on the X-chromomose is not the same as the proportion of base pairs on the X-Chromosome.

Exercise 19

birthDay <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08e1DayOfBirth.csv"))
head(birthDay)
##      day
## 1 Sunday
## 2 Sunday
## 3 Sunday
## 4 Sunday
## 5 Sunday
## 6 Sunday

Put the days of the week in the correct order for tables and graphs.

birthDay$day <- factor(birthDay$day, levels = c( "Monday",
    "Tuesday", "Wednesday","Thursday","Friday","Saturday","Sunday"))

birthDayTable <- table(birthDay$day)
data.frame(Frequency = addmargins(birthDayTable))
##   Frequency.Var1 Frequency.Freq
## 1         Monday             41
## 2        Tuesday             63
## 3      Wednesday             63
## 4       Thursday             47
## 5         Friday             56
## 6       Saturday             47
## 7         Sunday             33
## 8            Sum            350

Bar graph of the birth data. The argument cex.names = 0.8 shrinks the names of the weekdays to 80% of the default size so that they fit in the graphics window – otherwise one or more names may be dropped.

barplot(birthDayTable, cex.names = 0.8)

shortNames = substr(names(birthDayTable), 1, 3)
barplot(table(birthDay$day), names = shortNames,
ylab = "Frequency", las = 1, col = "firebrick")

\(\chi^2\) goodness-of-fit test. The vector p is the expected proportions rather than the expected frequencies, and they must sum to 1 (R nevertheless uses the expected frequencies when calculating the test statistic).

chisq.test(birthDayTable, p = rep(1/7,times=7))
## 
##  Chi-squared test for given probabilities
## 
## data:  birthDayTable
## X-squared = 15.24, df = 6, p-value = 0.01847

We find a small \(p-value\) (< 0.05) so we can rejected the null hypothesis

$H_0: $ birthdays are uniformly distributed across weekdays.

in favor of the alternative hypothesis

$H_A: $ birthdays are not uniformly distributed across weekdays.

From inspecting tha data, it appears as if births are less common on Sundays. It could be that scheduled caesarean deliveries are not scheduled on weekends. Another reason could be that hospitals may sometimes be understaffed on weekends and hence there might be a tendency to wait a bit longer with inducing labor in some non-critical cases.

Note, however, that the p-value is not extremely small. We would reject the null hypothesis if we choose \(\alpha = 0.05\) but not if we chosse \(\alpha = 0.01\). This illustrates that your choice of rejecting or not rejecting a null hypothesis is strongly influenced by the degree of certainty you want to have in your decision.

Week 9

Exercise 20

stalkie <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter11/chap11e2Stalkies.csv"))

We first transform the data to a standard normal distribution. WE achieve this by subtracting the mean (to get a mean of 0) and then dividing by the standard deviation. The latter is essentially changing the “unit” to standard deviations - as an analogy think of going from centimeters to meters: you divide by 100 because 1 meter = 100 cm.

M <- mean(stalkie$eyespan)
sigma <- sd(stalkie$eyespan)

z = (stalkie$eyespan -M)/sigma

print(z)
## [1] -0.22037998 -1.57613533  1.18558852  1.68772013  0.45749769 -0.32080631
## [7] -0.87315108  0.03068582 -0.37101947
print(round(mean(z)),digits = 10)
## [1] 0
print(sd(z))
## [1] 1

Next we do the same thing, except that now we measure the distance from the mean of the bull hypothesis (9 mm) in units of standard errors:

m0 = 9
n <- length(stalkie$eyespan)

SE <- sigma / sqrt(n)
test.statistic <- (M - m0) / SE
print(test.statistic)
## [1] -1.673772

This means that the test statistic is approximately 1.7 standard errors smaller than the mean in the null hypothesis. Is this a lot or not? We can get a better understanding of this by calculating the p-value and illustrating our result:

p.value = 2 * pt(-abs(test.statistic),df=8)
print(p.value)
## [1] 0.1327105
x=seq(-4,4,by=0.01)
y = dt(x,df=8)
plot(x,y,type="l")

abline(v = test.statistic,col="dodgerblue")
text(-3,0.2,"test statistic",col="dodgerblue")
text(2,0.3,"t-distribution, 
     with 8 degrees of freedom")

x1 = seq(-4,test.statistic,by=0.01)
y1 = dt(x1,df=8)

x2 = rev(x1)
y2 = y1*0

polygon(c(x1,x2),c(y1,y2),col = "red")
polygon(-c(x1,x2),c(y1,y2),col = "red")
text(-3.5,0.05,"P value",col="red")

For comaprison: the p value obtained using the t-test function:

test = t.test(stalkie$eyespan-9)

print(test$p.value)
## [1] 0.1327105

and the test statistic is

print(test$statistic)
##         t 
## -1.673772

We can summarize this in a table:

dat = data.frame(pValues = c(p.value,test$p.value),tStatistics = c(test.statistic,test$statistic))
colnames(dat) = c("P Value","Test Statistic")
rownames(dat) = c("by hand","using t.test()")
kableExtra::kable(dat) 
## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 7 > 1' in coercion to 'logical(1)'
P Value Test Statistic
by hand 0.1327105 -1.673772
using t.test() 0.1327105 -1.673772

Exercise 21

data = c(58.9, 7.8, 108.6, 44.8, 11.1, 19.2, 61.9, 30.5, 12.7, 35.8, 7.4, 39.3, 24.0, 62.1, 24.3, 55.3, 32.7, 65.3, -19.3, 7.6, -5.2, -2.1, 31.0, 69.0, 88.6, 39.5, 20.7, 89.0, 69.0, 64.9, 64.8)
  1. What is the sample size n?
n = length(data)

print(paste("The sample size is ",n))
## [1] "The sample size is  31"
  1. What is the mean of these data points? Remember to give the units.
m = mean(data)

print(paste("The mean shift is ",m, "meters."))
## [1] "The mean shift is  39.3290322580645 meters."
  1. What is the standard deviation of elevational range shift? (Give the units as well.)
sd = sd(data)

print(paste("The staandard deviation of the shift is ",round(sd,digits=2)," meters."))
## [1] "The staandard deviation of the shift is  30.66  meters."
  1. What is the standard error of the mean for these data?
SE = sd/sqrt(n)

print(paste("The standard error of the mean is ",round(SE,digits=2),"."))
## [1] "The standard error of the mean is  5.51 ."
  1. How many degrees of freedom will be associated with a confidence interval and a one-sample t-test for the mean elevation shift?

There are 30 (= \(n-1\)) degrees of freedom.

We can check with the t.test() function:

tt = t.test(data)
tt
## 
##  One Sample t-test
## 
## data:  data
## t = 7.1413, df = 30, p-value = 6.057e-08
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  28.08171 50.57635
## sample estimates:
## mean of x 
##  39.32903
  1. Calculate the 95 % confidence interval for the mean using these data.

We can either use our results from the t-test before:

tt$conf.int
## [1] 28.08171 50.57635
## attr(,"conf.level")
## [1] 0.95

or

upper = m + 2*SE
lower = m - 2*SE

c(lower,upper)
## [1] 28.31452 50.34355

A more precise version:

upper = m + qt(0.975,df = n - 1)*SE
lower = m + qt(0.025,df = n - 1)*SE
c(lower,upper)
## [1] 28.08171 50.57635
  1. For the one-sample t-test, write down the appropriate null and alternative hypotheses.

\(H_0:\) The mean elevational shift is 0 meters. \(H_A:\) The mean elevational shift is not 0 meters.

  1. Calculate the test statistic t for this test.

We canu again just use t.test():

tt$statistic
##        t 
## 7.141309

or do it by hand:

mu.0 = 0 # as specified in the null hypothesis
t.stat = (m - mu.0)/SE
t.stat 
## [1] 7.141309
  1. What assumptions are necessary to do a one-sample t-test?

The data should be approximately normally distributed. We can check this with a histogram

hist(data)

A slightly more precise way would be to do a QQ plot

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
qqPlot(data)

## [1]  3 19

As long as the points are within the dashed lines, you can safely do a t-test.